Both Blinda and I had a fascination with economics data and were particularly interested in finding ways to apply the techniques we have learned in data mining to the field of economics. In many cases, “economics” as an entire field seems way too daunting to attack as a whole. However, in some sense, that seems like a perfect reason to pick it for a data mining project! We could download a massive amount of rather tangentially related data, and find patterns or facts that we find interesting. In the end, our goal was to hopefully build some sort of algorithm or model that backs up the intuition that we have created learning Economics while at Columbia. Of course much of economics comes down to really cool new techniques for finding casual inferences like difference-in-difference and regression discontinuities, but we think some really interesting conclusions should be available with other tools.
As such our audience for this paper can be anyone who are interested in inflation forecast and care about thier future life cost. Topics that will be covered in the coming pages should be completely accessible to undergraduates and anyone that has even a very basic understanding of economics terms. While some of the features in the data may not be clear on their own, if they have significant or interesting results they will be explained. For example a term like M1 money. This is just economists making up a fancy term for immediately available money: money that is in checking accounts pretty much and available through debit cards or ATMs.
Because we didn’t have a particular focus to our economic exploration, we decided that we would try and gather as wide of a range of data as we could, combine them into one large dataset and then look for interesting paths to follow.
This strategy led us to “FRED”; not a person but rather the Federal Reserve Bank of St. Louis which provides an incredible amount of data direct from one of the 12 U.S. federal reserve banks. As a result, this data is the most direct we can get to the Economy at a sense and is commonly used by economic researchers. Also, thanks to their API, our research project should be able to be updated simply by rerunning the markdown file to include any new data.
The beauty of using FRED data is that there is not just one dataset, but hundreds of different ones. As such, not knowing exactly what we wanted, we found we could actually use the api function in order to find the most popular datasets by calls! This allowed us to rely on the experience of other researchers to see what data was popular and what we should be looking at.
Here is an example api to FRED searching through the series and asking for the 5 most popular data series.
popular_funds_series <- fredr_series_search_text(
search_text = "federal funds",
order_by = "popularity",
sort_order = "desc",
limit = 5
)
As a result, we ended up calling 27 different datasets and combining them together. These datasets range widely in what they describe and wildly on their date ranges. As examples, the datasets describe everything from a consumer price index, estimating the average price of goods nationwide, to the number of vehicle sales. Larger macroeconomics data was pulled as well including gross domestic product (GDP), unemployment rate, or even the prices of bonds. In summary, we have from prior knowledge classified the data into five categories:
Because the data comes from the same source, we assumed that they would merge together nicely; however, this was not the case. The primary problem was when the data was collected, and for most datasets the date was different. In general, data was collected either: daily, weekly, monthly, quarterly or yearly. However, this means that not every dataframe had the same dates to merge on, so the final data frame had a large amount of NAs.
Additionally, not all data was collected starting in the same timeframe. For example, federal debt has existed for longer than the inception of some of the other metrics. As a result, some debt data goes all the way back to January of 1901 (through the API at least), whereas other datasets start as recently as 1987.
However in the end, through the use of merge, we were able to build one complete dataset out of 27 different sources and with more than 16,000 rows.
The data as a whole is a beautiful combination of the most downloaded data sources and a variety of different gauges of the market and the United States as a whole.
We initially began by graphing a variety of sources to see how they compare. At first, I thought something was wrong with our data. Below is a graph of Federal Surplus or Deficit (FYFSD) and the US real gross domestic product (GDPC1).
No matter what we graphed against the US deficit, everything else appeared to be flat. In reality, this just goes to show the shear size of the deficit the United States is running. In fact, 3 trillion next to almost any other number in our data set is massive. This may be important to remember later, and make sure this dataset isn’t having undue influence on our models.
When everything else is compared to each other, they seem more more reasonable. For example, this shows the unemployment rate (UNRATE) and the federal funds rate (FEDFUNDS) with a similar negatively related pattern and within a reasonable range.
Data like this actually appears really interesting and is telling the closer you look at it. At first, it may appear just like noise when in fact there really is a relationship at play. The federal funds rate appears to spike, immediately followed by the unemployment rate. In a sense, the unemployment rate is trailing the federal funds rate.
In reality, its not quite as simple. More often when economic times are good, the federal funds rate is raising, so that it can be lowered when the unemployment rate rises. It is a tool for economists to try and help the economy. And after a while, another recession uccurs which forces the rate back down to drive inflation because controlling both unemployment rate and inflation are Fed’s two mandates.
In terms of data manipulation, this graph could be slightly difficult for something like regression to parse. In a sense, the data is incredibly time dependent being time series data and as a result, may not fit well in basic regression especially.
Other graphs showed other very intertwined relationships. For example, there is not much closer than comparing GDP to GDPC1, or Real GDP. Real GDP is inflation adjusted GDP, meaning it shows the production output when we account for money slowly losing value. It was interesting to see that after 2012, the GDP deflator (GDP/GDPC1) changed to positive and the gap between GDP and real GDP gradually increase sicne then, showing that the level of prices has gradually increase with respect to its base year.
Some of the results are interesting to see if they match our intuition. For example, personal savings rate and vehicle sales!
It seems much further off than I expected. Vehicle sales seem almost constant (in teal) with the exception of recessions, before immediately coming back. Personal savings however have seen a steady decrease regardless, although the stimulus appears to have had an amazing effect!
One of the immediate problems with the data was the somewhat random in when it was gathered. Some daily, some monthly, and some seemingly randomly. Others were done by quarter (once every three months), but of course they aren’t always released on the first of the month and because we have merged the data daily, it makes it really hard to run any algorithms on.
As an example, the federal debt appears to be gathered just once a year in September, whereas other data is gathered every single day. So the data needed to be processed in such as way that we could more easily work with it. Our conclusive way of doing so was to create summarized data frames. So we created two data frames that were grouped by month and year, averaging the values and including NaN for all values that had no data points.
As a result, we had a yearly data table that consisted of 34 years, ranging 1987 to 2021, and a monthly dataframe ranging the same years but with a more granular, 1246 monthly data points.
To show a point of how inconcistent some of the data was, simply cleaning the monthly data by pulling only months where all data series had at least an entry yields absolutely no months. So instead, for the monthly approach, we decided to take a different approach. To figure that out, we graphed the number of missing values by series in the monthly data.
This yielded an interesting finding. In fact, there appear to be two types of data here. Data collected quarterly and data collected monthly (or even more granularly). Then there are exceptions like total federal debt which is collected once each year in september. Data sets needed to be segregated by their collection frequency; ones that didn’t line up enough with the majority of the data we moved to their own group. This left us with monthly, weekly and daily data in one table knon as monthly, whereas the data that was collected quarterly was put into another data frame.
For intuition, I decided to run regression on the yearly dataset, as it has a few additional features that we lose when picking quarterly or monthly, so I thought it couldn’t hurt to try. As for what to predict, I thought the most interesting would actually be the inflation expectation. This is a metric calculated by Michigan State which says what do American consumers expect the price of goods and services to change during the next 12 months. In a sense, this allows us to approximate what the effects of all the other features are on Americans. Its value represents the median expected price change over the next 12 months as a survey of consumers.
This is exciting because it acts as a proxy for how people feel about the economy, which we find to be a much more interesting question than what will bond prices be next year. And since the start of 2021, more and more people pay attention to the potential rise in inflation after Fed’s unprecedented stimulus package initiated. In a sense, the inflation expectation allows us to analyze how humans feel about the economy: anxious or encouraged by looking at the economic indicators.
##
## Call:
## lm(formula = inflation_expectation ~ ., data = df_yearly_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.145563 -0.040709 0.000795 0.044900 0.170767
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.569e+01 2.325e+01 -1.105 0.3057
## year 2.024e-03 1.834e-03 1.104 0.3062
## consumer_price_index_urban 4.908e-03 1.093e-01 0.045 0.9654
## ten_year_treasury -2.544e-01 7.424e-01 -0.343 0.7419
## ten_minus_two_treasury -4.532e-01 3.484e-01 -1.301 0.2345
## ten_minus_three_months_treasury 5.473e-02 8.207e-01 0.067 0.9487
## unemployment_rate -1.392e+00 5.168e-01 -2.693 0.0309 *
## real_gdp -1.147e-03 2.037e-03 -0.563 0.5909
## eff_fed_funds_rate -2.164e-01 6.187e-01 -0.350 0.7368
## gdp 4.179e-04 1.804e-03 0.232 0.8235
## thirty_year_fixed_mortgage -3.358e-01 5.074e-01 -0.662 0.5293
## m_two -3.400e-03 1.405e-03 -2.420 0.0461 *
## velocity_of_mtwo -9.242e+00 4.372e+00 -2.114 0.0724 .
## m_one 1.339e-03 5.590e-04 2.395 0.0478 *
## all_employees_minus_farmers -4.551e-04 2.930e-04 -1.553 0.1643
## aaa_corporate_bond_yield 8.332e-01 5.072e-01 1.643 0.1444
## personal_savings_rate -3.605e-02 8.213e-02 -0.439 0.6740
## personal_consumption_expenditures 4.257e-03 2.079e-03 2.048 0.0798 .
## industrial_production_index -1.421e-04 5.567e-02 -0.003 0.9980
## federal_debt_total 7.994e-08 7.471e-07 0.107 0.9178
## labor_force_participation_rate 1.538e+00 6.437e-01 2.389 0.0483 *
## consumer_price_index_nationwide 1.717e+00 7.822e-01 2.195 0.0642 .
## s_and_p 1.171e-01 7.062e-02 1.658 0.1412
## m_three NA NA NA NA
## federal_surplus_or_deficit -3.025e-07 8.374e-07 -0.361 0.7285
## house_price_index_nationwide -9.394e-02 4.795e-02 -1.959 0.0909 .
## total_vehicle_sales -2.854e-01 1.199e-01 -2.380 0.0489 *
## federal_debt_percent_of_gdp -3.299e-02 1.141e-01 -0.289 0.7808
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.156 on 7 degrees of freedom
## Multiple R-squared: 0.9743, Adjusted R-squared: 0.8789
## F-statistic: 10.21 on 26 and 7 DF, p-value: 0.00199
Unsurprisingly, there is not a ton of significance throughout, and there is likely high colinearity between many of these. However, it is interesting to look at what was pulled through so far. Of the features that had a significance code, total vehicle sales and M2 money supply seem to be the most impactful. M2 money is a count of all money including cash and checking deposits (M1) as well as money saved in savings accounts, money market securities, mutual funds and other time deposits. In a sense, it is a good gauge of how much physical money Americans have that they are able to spend.
However, there are quite a few shortcomings to this initial approach. Noramlization might be helpful, and much of the data is highly correlated. For example, high correlation can very easily be identified by graphing:
From an even very brief glance, we can see quite a few of our features of very correlated. In a sense, this should be expected in a finance market with everything working together. For example, the Effective federal funds rate is designed to regulate many of the other features we have in our dataset, so it seems reasonable they are correlated. Also, some features are derivatives of others such as federal deficit and federal debt.
In fact, there are 7 features that have a correlation of over .99 with at lest one other feature.
## [1] "m_three" "m_two"
## [3] "gdp" "personal_consumption_expenditures"
## [5] "consumer_price_index_urban" "ten_year_treasury"
## [7] "house_price_index_nationwide"
We can get a partial view of some of these strong relationships through a further graph exploration of a subset of the data.
Of course, now the immediate next step is likely to try Ridge, Lasso and Elasticnet. These should “fix”, at least to some extent the problem of multicolinearity.
To get a more granular approach which is more appropriate, the other techniques will be tried on the monthly data instead of the yearly.
The data will be split into train and test datasets. The model will be trained on the train data and evaluated on the test data. This helps us gauge our robustness and see if we are at risk of overfitting. An inherent problem when dealing with financial data is the ability to say that your model is predictive of the future. It is commonplace for any stock or financial advisor to say that past results are not indicative of future porformance. As a result, just because there was a relationship between variables that we can see, doesn’t mean that will remain true. This is increasingly true if there is a new type of “shock” to the economy, like a global pandemic.
Our first new approach is to normalize the data in a hope of reducing some of the effects that certain datasets could have when their scales were so much larger than others.
##
## Call:
## lm(formula = inflation_expectation ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10065 -0.34969 -0.02482 0.24613 2.23488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.774e-12 3.469e-02 0.000 1.000000
## consumer_price_index_urban 6.328e+00 7.136e-01 8.868 < 2e-16 ***
## ten_year_treasury -1.860e+00 4.491e-01 -4.142 4.63e-05 ***
## ten_minus_two_treasury -1.917e-01 1.565e-01 -1.225 0.221646
## ten_minus_three_months_treasury -1.567e-02 2.808e-01 -0.056 0.955535
## unemployment_rate -9.710e-01 1.317e-01 -7.371 2.15e-12 ***
## eff_fed_funds_rate -1.243e-01 5.157e-01 -0.241 0.809798
## thirty_year_fixed_mortgage 1.921e+00 4.961e-01 3.873 0.000136 ***
## m_two -3.868e+04 2.981e+04 -1.297 0.195636
## m_one 2.238e-01 1.179e-01 1.897 0.058867 .
## all_employees_minus_farmers -5.015e+00 6.042e-01 -8.301 5.24e-15 ***
## aaa_corporate_bond_yield 1.398e+00 4.750e-01 2.943 0.003537 **
## personal_savings_rate 7.948e-02 1.188e-01 0.669 0.504036
## personal_consumption_expenditures 3.995e-01 1.053e+00 0.379 0.704626
## industrial_production_index 8.192e-01 3.297e-01 2.485 0.013579 *
## labor_force_participation_rate 3.401e-01 2.694e-01 1.262 0.207919
## consumer_price_index_nationwide 3.069e-01 3.847e-02 7.978 4.45e-14 ***
## s_and_p 3.955e-01 1.639e-01 2.414 0.016460 *
## m_three 3.868e+04 2.981e+04 1.297 0.195662
## total_vehicle_sales -4.769e-02 8.302e-02 -0.574 0.566146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5866 on 266 degrees of freedom
## Multiple R-squared: 0.6788, Adjusted R-squared: 0.6559
## F-statistic: 29.59 on 19 and 266 DF, p-value: < 2.2e-16
These are more exciting results! Using the scaled monthly data, we have some very significant results with non-trivial estimates. Things that we think effect our inflation expectation: how the cost of a basket of goods changes nationwide, the ten year treasury note, the unemployment rate, the 30 fixed mortgage rate, number of employees in the US. As would be expected, the treasury bonds, the unemployment rate and the number of employees are all inversely correlated. According to economic theory, Inflation and unemployment have maintained an inverse relationship as represented by the Phillips curve. We would assume prices won’t be forced up when there are still people that want work and When unemployment is low, more consumers have discretionary income to purchase goods. When the bond yields are really high, meaning bonds are less desirable, and that signals less purchasing power in the market.
As for the linear model itself, it is not terrific but also not as poor of a model as one might expect. Our \(R^2\) values are above .65 and our RMSE is within reason.
Running on train data:
## [1] "Adjusted R^2" "0.66"
## [1] "RMSE" "0.57"
Rerunning on test data:
## [1] "Adjusted R^2" "0.66"
## [1] "RMSE" "0.78"
\(R^2\) remains the same and RMSE increases modestly on test data. Still not wonderful but better than I would naturally expect!
## [1] 286 19
## [1] 123 19
Initally, because of the extreme correlation in some, I would expect lasso to be the technique we eventually want to resort to but I figure we might as well try ridge as well.
## [1] 0.001
Optimal lambda comes out at 0.001. I’m skeptical of this value immediately, and it may require more exploration to see if there are optimizations to improve it.
## [1] 0.001
## [1] 0.001
## [1] 0.003162278
Taking a step back to look at what we are estimating. The expected values are between 0 and 6 effectively, for what will be the inflation in the immediate future. Really where our modeling might be letting us down is in the fact that the majority of the time, you can guess 2.5 to 3.5 % inflation and be right.
## glmnet
##
## 409 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 327, 327, 328, 327, 327
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0003948395 0.3732175 0.5602756 0.2690492
## 0.10 0.0039483948 0.4043822 0.4873458 0.2768239
## 0.10 0.0394839484 0.4509953 0.3678330 0.3088071
## 0.55 0.0003948395 0.3723209 0.5620107 0.2692984
## 0.55 0.0039483948 0.4031312 0.4916440 0.2750844
## 0.55 0.0394839484 0.4800511 0.2835748 0.3331703
## 1.00 0.0003948395 0.3702357 0.5667271 0.2689143
## 1.00 0.0039483948 0.3954760 0.5125016 0.2718174
## 1.00 0.0394839484 0.4962108 0.2336807 0.3458847
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0003948395.
We end up with very similar results: because alpha is at 1 so it is practicly doing the exact same algorithm as above. There is few difference between working on normal OLS regression objective and lasso/ridge.
For the financial data, PCA provides an intersting approach in order to group the data together into new components. This should create a few really meaningful components compared to our current variety of correlated features.
Appears to be a noticable drop off at 5 actually. While it is not immediately clear to us what each component represents in reality, an economics expert likely would be able to pick out these features.
At first, clustering didn’t make much sense for such a wide variety of data. After all, what ways could we describe the entire economy of the United States?
Well, when looking at the data graphed on FRED’s website, it is clear that there is one major factor that the government is looking for: are we in a recession. In fact, it could be possible to pull out the different parts of the economic cycle from clustering. There should be characteristic markings in the data when the economy is in a recession versus when it is booming. The question is are those markings clear.
At first the distrubtion of groups actually looked decent. Using three phases, growth, recession and a sort of stagnation, we were hoping to capture the phases. However, running it on the original datasets led to three, completely chronological groups. Factors like debt, gdp and deficit likely were too much for the klustering to overcome.
Even after scaling the data, chronological groups were all that klustering could pull out.
With future exploration, it would be wise to look at the derivatives of the features. Looking at the change in GDP for example would be a great way to possibly group the data.
Here we introduce population data from the census in order to hopefully look at per capita figures. This should hopefully find some new realizations. The US Census API actually has data for the entire world which is fascinating!
One reason why klustering might not be working so well is the effects that population growth is having on the country. There is undoubtably differences between the last few decades and klustering is pulling that out through the numbers.
Out of caution, this will only be applied to the yearly data as this is the most granular level we have. In the future, you could linearly estimate the data assuming an equal distribution of birthdays throughout the year and that should be close.
The reason to create this type of feature is actually a much more helpful way of standardizing some of these values. For example, car sales will naturally climb with more people. But if it climbs per capita that would be a good indicator that inflationary levels and expecations. House prices could be affected in a similar way.
From our initial regression, we know that m2 money was important. However, this could be inappropriate because there should be more money with a lot more people so this is helpful to make sure that truly is significant.
Unfortunately, there doesn’t appear to be much of a difference in the S and P to say the last. Let’s see how regression ends up.
##
## Call:
## lm(formula = inflation_expectation ~ ., data = scaled_percap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79322 -0.37557 -0.09179 0.43219 1.92436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.335e-15 1.204e-01 0.000 1.0000
## car_sales_percap 7.811e-02 4.816e-01 0.162 0.8725
## house_price_percap -2.544e+00 2.554e+00 -0.996 0.3290
## s_and_p_percap 2.885e+00 2.216e+00 1.302 0.2054
## m_two_percap -1.229e+00 8.234e-01 -1.492 0.1486
## consumer_price_index_percap 3.784e+00 1.785e+00 2.120 0.0446 *
## aaa_bond_percap 1.909e+00 9.339e-01 2.044 0.0521 .
## all_empoyees_percap -6.750e-01 3.977e-01 -1.697 0.1026
## personal_consumption_expenditures -5.554e-01 1.941e+00 -0.286 0.7772
## eff_fed_funds_rate 7.447e-01 3.455e-01 2.156 0.0414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.702 on 24 degrees of freedom
## Multiple R-squared: 0.6416, Adjusted R-squared: 0.5072
## F-statistic: 4.774 on 9 and 24 DF, p-value: 0.001032
However, even scaled for population, clustering fails to recognize anything other than time, and the simplified per-capita regression provided no additional information compared to the full model. Some of this lack of fixing our problem may be in the nature of infaltion expectations. In a sense, just the number of people doesn’t necessarily affect inflation. Rather, numbers like the number of people consuming goods, such as going into retirement of immigrating may be much more representative. This is supported by our initial regression showing unemployment rate and people’s free money are strongly related to predicted inflation.
We have gone through the following process to avoid the failure to acknowledge that the correlation was in fact the result of chance. We are confident that the comparison among different models is a valid metric to prevent data dredging. The analysis was conducted in k-fold cross-validations to assessing the model performance. Then, with testing error higher than the training error that probably indicates overfitting, we introduced variable regularization (Lasso & Ridge) to the model and intend to reduce the number of free parameters. However, the result turns out to be the optimal lambda close to zero, and thus Lasso & Ridge produce the close to classical least square coefficients. Nevertheless, there is serious risk of “P-hacking” or data dredging when doing explorations as we have outlined above. By trying a wide variety of normalizations and explorations of the data increases the risk of finding spurrious correlations or results that are only due to chance. As a result, it is important to remember these results are exploratory, and shouldn’t be used for inference without more research.
While there are some limitations in our dataset. For example, the missing factors and unmeasured confounders may generate bias in predicting inflation expectation. The variables that we imported are constructed on some empirically motivating characteristic of the economy and those numbers are captured by experts or big data. But inflation expectation, even at an aggregate level, is measured by the survey to evaluate consumer sentiment. However, the consumer’s learning process and the role of trust in the economic variables are not considered in the model.
In summary, this whole project started by mining economic variables from the FRED database. Then, through cleaning, identifying, and learning more about the data, we attempted to discover and analyze patterns that are related to inflation expectation. Later, we tried to use a different combination of predictors and models, including OLS, scaling, Lasso, Ridge, clustering, and adding population feature, then evaluate their feasibility respectively. Eventually, we believe that the OLS model on scaled monthly data generates with more accuracy and interpretability. Our regression model suggests that that inflation expectation has a stronger relationship in current inflation, ten-year treasury bond, the unemployment rate, the 30 fixed mortgage rate, number of employees in the US. Within those key variables, the treasury bonds, the unemployment rate, and the number of employees are negatively correlated. And this relation also aligns with economic theory and market intuition. How human think about the future inflation comes from their current feeling in price of good and overall longterm cost of financing and living. Both mortgage rates, though not revealed in economic theory, has a similar meaning with ten year tresury yield which implies the long-term cost of financing. Lower rates imply a lower cost of living, thus the higher purchasing power in the future. And with low levels of unemployment, more consumers have discretionary income to purchase goods and thus the prices of good rise.